frequency count
Density Estimation via Measure Transport: Outlook for Applications in the Biological Sciences
Lopez-Marrero, Vanessa, Johnstone, Patrick R., Park, Gilchan, Luo, Xihaier
The problem of estimating a probability distribution density from samples (e.g., observations, measurements, or simulation data) is ubiquitous in data science, uncertainty quantification, clustering and classification, and probabilistic modeling and inference tasks. Moreover, it is common among various scientific and engineering fields, including biology [38, 14, 1, 41, 5, 7, 12, 39]. Often, wellknown parametric density functions (dependent on few parameters), such as the Gaussian or Weibull density distribution functions, are adopted. While this may simplify certain tasks (e.g., computational ones), many of these known density distribution functions are not necessarily suitable for characterizing data that exhibit complex features, such as (spatial and/or temporal) correlations and non-Gaussian characteristics. For instance, as reported in [7], accounting for differences in the distribution densities of gene expressions can lead to improved interpretation of cancer transcriptomic data. Hence, a density estimation framework capable of characterizing a diverse range of properties is highly desirable. A measure transport approach [44, 37, 36] offers this possibility. Optimal measure transport, broadly defined, deals with the problem of minimizing the cost of transporting one (probability) measure to another.
GitHub - explosion/sense2vec: 🦆 Contextually-keyed word vectors
This library is a simple Python implementation for loading, querying and training sense2vec models. To explore the semantic similarities across all Reddit comments of 2015 and 2019, see the interactive demo. Note that this example describes usage with spaCy v3. To try out our pretrained vectors trained on Reddit comments, check out the interactive sense2vec demo. This repo also includes a Streamlit demo script for exploring vectors and the most similar phrases.
Using Machine Learning to Classify Tweets
I recently had the opportunity to take on a project with Inspirit AI where I worked with a team to use machine learning to classify whether tweets were considered positive, negative, or neutral as they related to different stocks. In order to do that we explored three different machine learning models for classifying text: bag-of-words, long short-term memory (LSTM), and bidirectional encoder representations from transformers (BERT). Here I describe our experience in solving this problem and highlight what we learned from using the different methods. The bag of words model works by grouping each word into a bag or frequency count based on how often the word is used. The frequency count can be used as a feature in a machine learning model.
Predicting Antimicrobial Resistance in the Intensive Care Unit
Wang, Taiyao, Hansen, Kyle R., Loving, Joshua, Paschalidis, Ioannis Ch., van Aggelen, Helen, Simhon, Eran
Antimicrobial resistance (AMR) is a risk for patients and a burden for the healthcare system. However, AMR assays typically take several days. This study develops predictive models for AMR based on easily available clinical and microbiological predictors, including patient demographics, hospital stay data, diagnoses, clinical features, and microbiological/antimicrobial characteristics and compares those models to a naive antibiogram based model using only microbiological/antimicrobial characteristics. The ability to predict the resistance accurately prior to culturing could inform clinical decision-making and shorten time to action. The machine learning algorithms employed here show improved classification performance (area under the receiver operating characteristic curve 0.88-0.89) versus the naive model (area under the receiver operating characteristic curve 0.86) for 6 organisms and 10 antibiotics using the Philips eICU Research Institute (eRI) database. This method can help guide antimicrobial treatment, with the objective of improving patient outcomes and reducing the usage of unnecessary or ineffective antibiotics.
Semantic Understanding for Contextual In-Video Advertising
Madhok, Rishi (Delhi Technological University) | Mujumdar, Shashank (IBM Research, India) | Gupta, Nitin (IBM Research, India) | Mehta, Sameep (IBM Research, India)
With the increasing consumer base of online video content, it is important for advertisers to understand the video context when targeting video ads to consumers. To improve the consumer experience and quality of ads, key factors need to be considered such as (i) ad relevance to video content (ii) where and how video ads are placed, and (iii) non-intrusive user experience. We propose a framework to semantically understand the video content for better ad recommendation that ensure these criteria.
Clouds, clouds, and more clouds
There are at least eleven kinds of clouds: cirrus, cirrocumulus, cirrostratus, altocumulus, altostratus, cumulonimbus, cumulus, nimbostratus, stratocumulus, small Cu, and stratus. But this article is not about those kinds of clouds. Of course there are other kinds of clouds, like iCloud, Google Cloud, Azure Cloud, Amazon Cloud, and the list goes on. But this article is not about those clouds either. This article is about text analytics.
The Revisiting Problem in Mobile Robot Map Building: A Hierarchical Bayesian Approach
Stewart, Benjamin, Ko, Jonathan, Fox, Dieter, Konolige, Kurt
We present an application of hierarchical Bayesian estimation to robot map building. The revisiting problem occurs when a robot has to decide whether it is seeing a previously-built portion of a map, or is exploring new territory. This is a difficult decision problem, requiring the probability of being outside of the current known map. To estimate this probability, we model the structure of a "typical" environment as a hidden Markov model that generates sequences of views observed by a robot navigating through the environment. A Dirichlet prior over structural models is learned from previously explored environments. Whenever a robot explores a new environment, the posterior over the model is estimated by Dirichlet hyperparameters. Our approach is implemented and tested in the context of multi-robot map merging, a particularly difficult instance of the revisiting problem. Experiments with robot data show that the technique yields strong improvements over alternative methods.
Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering
The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. Because the k -modes algorithm uses the same clustering process as k -means, it preserves the efficiency of the k -means algorithm. Although the k -modes algorithm is very efficient, it suffers the problem that the clustering results are sensitive to the selection of the initial points. Hence, a better initial points selection procedure would improve the reliability and accuracy of clustering results. To that end, an iterative initial-points refinement algorithm for k -modes clustering has been presented in [2]. As shown in [2], the new initialization pr ocedure greatly improves the reliability and accuracy of final clustering results. Despite the su ccess of Ref. [2], the following observations motivate us to further pursue other alternative initialization methods.
Evaluating Variable Length Markov Chain Models for Analysis of User Web Navigation Sessions
Markov models have been widely used to represent and analyse user web navigation data. In previous work we have proposed a method to dynamically extend the order of a Markov chain model and a complimentary method for assessing the predictive power of such a variable length Markov chain. Herein, we review these two methods and propose a novel method for measuring the ability of a variable length Markov model to summarise user web navigation sessions up to a given length. While the summarisation ability of a model is important to enable the identification of user navigation patterns, the ability to make predictions is important in order to foresee the next link choice of a user after following a given trail so as, for example, to personalise a web site. We present an extensive experimental evaluation providing strong evidence that prediction accuracy increases linearly with summarisation ability.
Attribute Value Weighting in K-Modes Clustering
He, Zengyou, Xu, Xaiofei, Deng, Shengchun
Categorical data clustering is an important research problem in pattern recognition and data mining. The k -modes algorithm [1] extends the k -means paradigm to cluster categorical data by using (1) a simple matching dissimilarity measure for categorical objects, (2) modes instead of means for clusters, and (3) a frequency-based method to update modes in the k -means fashion to minimize the cost function of clustering. The k -modes algorithm is widely used in real world applications due to its efficiency in dealing with large categorical database. In standard k -modes algorithm, a simple matching similarity measure is used, in which the distance is either 0 or 1. Such simple matching dissimilarity measure doesn't consider the implicit similarity relationship embedded in categorical values, which will result in a weaker intra-cluster similarity by allocating less similar objects to the cluster. To illustrate this fact, let's consider the following example shown in Fig.1. Example 1: In this artificial example, the dataset is described with 3 categorical attributes A1, A2,and A3, and there are two clusters with their modes. Assuming that we have to allocate a data object Y = [a, p, w] to either cluster 1 or cluster 2. According to the k -modes algorithm, we can assign Y to either cluster 1 or cluster 2 since these two clusters have the same mode. However, from the viewpoint of intra-cluster simila rity, it is more desirable to allocate Y to cluster 1.